Appl. No. 09/702, 484 

Amdt. Dated Feb. 27, 2004 

Response to Office Action of October 1, 2003 



Amendments to the Specificati n: 

Rewrite the fifth paragraph on page 8 as follows: 



6 



Fig. 4a depicts an example of a 32-bit Opcode showing th e incorporation of instructions r e lating 
I to srcl and sr€2 instruction to perform a relative branch with NOPs (BNOP) operation; and Fig. 
6b 4b depicts the pipeline format for performing a relative BNOP operation; and 

Rewrite the sixth paragraph on page 8 as follows: 

Fig. 5a depicts an example of a 32-bit Opcode s howing th e incorporation of instructions 
^ r e lating to srcl and srcl instruction to perform an absolute BNOP operation^; and Fig. 6b 5b 
depicts the pipeline format for performing an absolute BNOP operation. 

Rewrite the first paragraph at the top of page 9 as follows: 

Data processing devices suitable for use with and incorporating this invention are 
described in U.S. Patent Application Serial No. (Attorn e y Dock e t No. TI 

30302X fil e d February 1 8 , 2000, 09/703,096 entitled "Microprocessor with Improved Instruction 



)^ Set Architecture" , which and is incorporated herein by reference. In an embodiment of the 
present invention, there are 64 general-purpose registers. General purpose registers AO, Al, A2, 
BO, Bl and B2 each may be used as a conditional register. Further, each .D unit may load and 
store double words (64 bits). The .D units may access words and double words on any byte 
boundary. The .D unit supports data as well as address cross paths. The same register may be 
used as a data path cross operand for more than one fiinctional unit in an execute packet. A 
delay clock cycle is introduced when an instruction attempts to read a register via a cross path 
that was updated in the previous cycle. Up to two long sources and two long results may be 
accessed on each data path every cycle. 

Rewrite the first complete paragraph at page 10 as follows: 
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In microprocessor 1 there are shown a central processing unit (CPU) 10, data memory 22, 
program memory 23, peripherals 60 and an extemal memory interface (EMIF) with a direct 
memory access (DMA) 61. CPU 10 further has an instruction fetch/decode unit lOa-c, a 
plurality of execution xinits, including an arithmetic and load/store unit Dl, a multiplier Ml, an 
ALU/shifter unit SI, an arithmetic logic unit ("ALU") LI, a shared multi-port register file 20a 
jfrom which data are read and to which data are written. Instructions are fetched by fetch unit 
] 10a from instruction memory 23 over a set of busses 41 . Decoded instructions are provided from 
the instruction fetch/decode unit lOa-c to the functional units Dl, Ml, SI, and LI over various 
sets of control lines which are not shown. Data are provided to/from the register file 20a from/to 
to load/store units unit Dl over a first set of busses 32a, to multiplier Ml over a second set of 
busses 34a, to ALU/shifter imit SI over a third set of busses 36a and to ALU LI over a fourth set 
of busses 38a. Data are provided to/from the memory 22 from/to the load/store umte unit Dl via 
a fifth set of busses 40a. Note that the entire data path described above is duplicated with 
register file 20b and execution units D2, M2, S2, and L2. Load/store unit D2 similarly interfaces 
with memory 22 via a second set of busses. Instructions are fetched by fetch unit 10a from 
instruction memory 23 over a set of busses 41. Emulation circuitry 50 provides access to the 
intemal operation of integrated circuit 1 which may controlled by an extemal test/development 
system (XDS) 51. 

Rewrite the last paragraph at the bottom of page 10 and continuing to page 1 1 as follows: 

When microprocessor 1 is incorporated in a data processing system, additional memory 
or peripherals may be connected to microprocessor 1, as illustrated in Figure 1. For example, 
1^ ^ Random Access Memory (RAM) 70, a Read Only Memory (ROM) 71 and a Disk 72 are shown 
connected via an extemal bus 73. Bus 73 is connected to the Extemal Memory Interface (EMIF) 
which is part of functional block 61 within microprocessor 43 1. A Direct Memory Access 
(DMA) controller is also included within block 61 . The DMA controller part of functional block 
61 connects to data memory 22 via a bus and is generally used to move data between memory 
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and peripherals within microprocessor 1 and memory and peripherals which are external to 
microprocessor 1. 

Rewrite the first complete paragraph at page 1 1 as follows: 

Each fimctional unit reads directly from and writes directly to the register file within its 
own data path. That is, the .LI, .SI, .Dl, and .Ml units write to register file A 20a and the .L2, 
.S2, .D2, and .M2 units write to register file B 20b. The register files are connected to the 
opposite-side register file's functional units via the IX and 2X cross paths. These cross paths 
allow functional units from one data path to access a 32-bit operand from the opposite side's 
register file. The IX cross path allows data path A's functional units to read their source fi-om 
register file B. Similarly, the 2X cross path allows data path B's fimctional xmits to read their 
source from register file A. 

Insert the following new paragraph after the second complete paragraph at page 1 1 as 
follows: 



^ S2 unit may write to control register file 102 from a dst output via a bus (not shown). S2 

/ imit may read from control register file 1 02 to its src2 input via a bus (not shown). 

Rewrite the second complete paragraph at page 12 as follows: 

Fig. 2 is a top level block diagram of a an A xmit group-TS, which supports a portion of 
the arithmetic and logic operations of DSP core^ 10. A unit group-^ handles a variety of 
operation types requiring a number of functional units including A adder unit 128, A zero detect 
unit 130, A bit detection unit 132, A R/Z logic unit 134, A pack/replicate unit 136, A shuffle unit 
138, A generic logic block unit 140, and A div-seed unit 142. Partitioning of the functional sub- 
units is based on the functional requirements of A unit group-TS, emphasizing maximum 
performance while still achieving low power goals. There are two input muxes 144 and 146 for 
the input operands, both of which allow routing of operands from one of five sources. Both 
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muxes have three hotpath sources from the A, C and S result busses, and a direct input from 
register file 76 RF in the primary datapath. In addition, srcl mux 144 may pass constant data 
^ from decode unit 62 (not shown), while src2 mux 146 provides a path for operands from the 
opposite datapath. Result mux 148 is split into four levels. Simple operations which complete 
early in the clock cycle are pre-muxed in order to reduce loading on the critical final output mux. 
A unit group-?8 also is responsible for handling control register operations 143. Although no 
hardware is required, these operations borrow the read and write ports of A imit group-7& for 
routing data. The src2 read port is used to route data from register file 76 (RF) to valid 
configuration registers. Similarly, the write port is borrowed to route configuration register data 
to register file 76 RF. 



Rewrite the last paragraph at the bottom of page 12 and continuing to page 13 as follows: 

Fig. 3 is a top level block diagram of S unit group-^, which is optimized to handle 
shifting, rotating, and Boolean operations, although hardware is available for a limited set of add 
and subtract operations. S unit group^ is unique in the most of the hardware may be directly 
controlled by the programmer. S unit group-82 has two more read ports than the A and C unit 
groups, thus permitting instructions to operate on up to four source registers, selected through 
input muxes 144, 146, 161, and 163. Similar to the A and C unit groups, the primary execution 
fimctionality is performed in the Execute cycle of the design. S imit group-82 has two major 
fimctional units: 32-bit S adder unit 156, and S rotate/Boolean unit 165. S rotate/Boolean unit 
165 includes S rotator unit 158, S mask generator unit 160, S bit replicate unit 167, S unpack/ 
sign extend unit 169, and S logical unit 162. The outputs from S rotator vmit 158, S mask 
generator unit 160, S bit replicate unit 167, and S unpack/ sign extend xmit 169 are forwarded to 
S logical unit 162. The various fimctional units that make up S rotate/Boolean unit 165 may be 
utilized in combination to make S unit group-82 capable of handling very complex Boolean 
operations. Finally, result mux 148 selects an output from one of the two major fimctional imits, 
S adder unit 156 and S rotate/Boolean unit 165, for forwarding to register file 76 RF. 



Rewrite the first complete paragraph at page 14 as follows: 
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Although the method and apparatus of this invention may be used with either load or 
branch instructions, the branch instructions tend to have more room to receive the additional 
NOP field. Thus, in a method for reducing total code size during branching, the method may 
comprise the steps of determining a latency in a shift betvs^een a first pipelined operation and a 
second pipelined operation. The latency may be determined by identifying the branch instruction 
and the first and second pipelined operations. Further the method may conclude by adding a 
NOP field to an end of the branch instruction, e^, B label, 5. In determining the latencies 
within a code, the code ay be manually or automatically searched to locate sections of code, such 
as branch operations which will necessitate latencies or delays. Alternatively, a particular 
program may be run and analyzed to determine wh e th e r the latencies within the program. 

Rewrite the first complete paragraph at page 15 as follows: 



The invention will be further clarified by a consideration of the following examples, 
which are intended to be purely exemplary of the use of the invention. As demonstrated by the 
following examples, the NOP operation may be encoded into or onto the instruction, such that 
the NOP is an operation issued in parallel with the instruction requiring the latency. Referring to 
the examples set forth above, the following examples show the code rewritten according to the 
present invention: 
^ . Example lb: 

LD *aO, a5 ,4 % "4" (Le,, four (4) cycles or delay slots) is the NOP field 

ADD a5, 6, a7 % a5 value available 

Example 2b: 

B label, 5 % "5" (Le,, five (5) cycles or delay slots) is the NOP field 

; % branch occurs 

As may be seen from these examples, the NOP field is an instruction operand that ranges from 0 
to the maximum latency of the instruction. Nevertheless, other ranges may be applied that may 
result in further savings on op-code encoding space. Another example is provided below for the 
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LD instruction of Example lb, in which a value less than maximum latency is used because other 
fby\ instructions are to be scheduled in the instruction's delay slots. 
Example Ic: 

LD*aO, a5, 3 % "3" (U,, three (3) cycles or delay slots) is the NOP field 

ADD a3, 5, a3 % a new instruction is inserted into the 4th delay slot 

ADD a5, 6, a? % a5 value available 

Rewrite the last paragraph at the bottom of page 15 and continuing to page 16 as follows: 



In still another embodiment of the invention, the latency may be identified within a 
Branch instruction performing a relative branch with NOPs, i^, a BNOP. An operation code or 
Opcode may be the first byte of the machine code that describes a particular type of operation 
and the combination of operands to the central processing unit (CPU). For example, the Opcode 
for the BNOP instruction may be formed by the combination of a BNOP (Mnit) code coupled 
with the identification of a starting source {src2) and an ending source (srcl) code, e.g. , 
Mnit^,Sl,.S2, In this format, the src2 Opcode map field is used for the scstl2 operand-type unit 
to perform a relative branch v^th NOPs using the 12-bit signed constant specified by src2. The 
constant is shifted two (2) bits to the left, then added to the address of the first instruction of the 
fetch packet that contains the BNOP instruction. Referring to Fig. 4a, an example of a 32-bit 
Opcode is depicted showing the incorporation of instructions r e lating to src2 and srcl BNOP 
instruction. 



6, 
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Rewrite the last paragraph at the bottom of page 16 and continuing to page 17 as follows: 

Only one branch instruction may be executed per cycle. If two (2) branch condition 
controls are in the same execute packet, Le., a block of instructions that execute in parallel, and if 
both are accepted, the program behavior is imdefined. Further, when a predicated BNOP 
(3 instruction is used with a NOP count greater than five (5), a C64X processor, available from 
Texas Instruments, Inc., of Dallas Texas, vAW insert the total number of delay slots requested, 
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only when the predicated condition is false. For example, the following set of instructions insert 
seven (7) cycles of NOPs into the BNOP instruction: 

ZERO .LI AO 
[AO] BNOP .SI LABEL,?. 

Thus, the branch is not taken-, and seven (7) cycles of NOPs are inserted. Conversely, when a 
predicated BNOP instruction is used with a NOP count greater than five (5) and the predication 
condition is true, the branch will be taken and the multi-cycle NOP will be simultaneously 
terminated. For e xampl e , . For example, the following set of instructions insert only five (5) 
cycles of NOPs into the BNOP instruction: 

MVK .Dl 1,A0 
[AO] BNOP .SI LABEL,?. 

Thus, the branch is taken, and five (5) cycles of NOPs are effectively inserted. 

Rewrite the last paragraph at the bottom of page 1 ? and continuing to page 1 8 as follows: 

In yet another embodiment of this invention, an operation code or Opcode again may be 
the first byte of the machine code that describes a particular type of operation and the 
combination of operands to the central processing unit (CPU). For example, the Opcode for the 
BNOP instruction again may be formed by the combination of a BNOP (Mnit) code coupled with 
0 J ^ the identification of a s tarting second source {src2) and on e nding a first source (srcl) code , e .g.. 
' Mmt-S2 , In this format, the src2 Opcode map field is used for the xunit operand-type unit to 
perform a absolute branch with NOPs. The register specified in src2 is placed in the program 
fetch counter (PFC), described above. The 3-bit unsigned constant specified in srcl, provides 
the number of delay slots NOPs to be inserted, e^, fi-om zero (0) to five (5).^ Thus, for 
example, with srcl^O, no delay slot NOPs are inserted. Consequently, this instruction also 
reduces the number of instructions required to perform a branch operation when NOPs are 
required to fill the delay slots of a branch. Referring to Fig, 5a, an example of a 32-bit Opcode 
is depicted showing the incorporation of instructions relating to src2 and srcL 



Rewrite the last paragraph at the bottom of page 18 and continuing to page 19 as follows: 
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As noted above, only one branch instruction may be executed per cycle. If two (2) 
branch condition controls are in the same execute packet and if both are accepted, the program 
behavior is undefined. Further, when a predicated BNOP instruction is used with a NOP count 
greater than five (5), a C64X processor, available fi-om Texas Instruments, Inc., of Dallas Texas, 
will insert the total number of delay slots requested, only when the predicated condition is false. 
For example, the following set of instructions insert seven (7) cycles of NOPs into the BNOP 
instruction: 

ZERO .LI AO 
[AO] BNOP .SI B3,7. 

Thus, the branch is not tak e n taken, and seven (7) cycles of NOPs are inserted. Conversely, 
when a predicated BNOP instruction is used with a NOP count greater than five (5) and the 
predication condition is true, the branch will be taken and the multi-cycle NOP will be 
simultaneously terminated. For e xampl e , . For example, the following set of instructions insert 
only five (5) cycles of NOPs into the BNOP instruction; 

MVK .Dl 1,A0 
[AO] BNOP .SI B3,7. 

Thus, the branch is taken, and five (5) cycles of NOPs are effectively inserted. 



9 



